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QCD on the BlueGene/L Supercomputer 

G. Bhanot, D. Chen, A. Gara, J. Sexton, and P. Vranas * a 

a IBM T.J. Watson Research Center, Route 134, Yorktown Heights, NY 10598, USA. 

In June 2004 QCD was simulated for the first time at sustained speed exceeding 1 TeraFlops in the BlueGene/L 
supercomputer at the IBM T.J. Watson Research Lab. The implementation and performance of QCD in the 
BlueGene/L is presented. 
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1. Introduction 

Lattice gauge theory has been intimately con- 
nected to numerical simulations and computer 
hardware from the beginning. More specifically 
lattice field theories lend themselves naturally to 
numerical simulations on massively parallel su- 
percomputers. These machines have defined the 
landmarks for computational speed through the 
years. There have only been a few and can all 
be remembered by name and by the new window 
they opened to research in field theory. 
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Figure 1. Two flavor dynamical Wilson HMC $ 
13 = 5.2, k=0.18, V= 32 3 x 64. 

In this sense, when in June 2004 lattice QCD 
ran on a 1024 node 5.6 Teraflop prototype Blue- 
Gene/L (BG/L) supercomputer at the IBM T.J. 
Watson Research lab for eight hours, at sustained 
speed exceeding 1 Teraflops, another landmark 
was crossed signaling the next generation of lat- 
tice gauge theory calculations. That simulation 
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Figure 1 shows the thermalization of the chiral 
condensate from an ordered gauge configuration. 
Furthermore, using the full 2048 node 11.5 Ter- 
aflop prototype another dynamical run was done. 
It sustained more than 2 Teraflops for 3 hours 
{V = 64 3 x 16, f3 = 5.1 and k = 0.180). At the 
moment of this writing, the available QCD code 
can sustain up to about 19% of peak. 

For details about BG/L see for example p^. For 
other QCD related supercomputers see for exam- 
ple |2j- The largest planned BG/L installations 
will be a 115 Teraflops 20K node machine at IBM 
Watson and a 367 Teraflops 64K node machine 
at Lawrence Livermore National Lab in 2005. In 
the June 2004 ranking of the world's fastest su- 
percomputers a BG/L machine (located at IBM 
Rochester) rated number 4 and another one (lo- 
cated at IBM Watson) rated number 8. 

Programming Lattice QCD to run efficiently on 
these large machines has always been a challenge. 
Here we describe the relevant BG/L hardware 
and how QCD is coded for it. We also present 
performance and scaling measurements. 

2. The microchip 

Figure 2 shows an abstraction of the BG/L chip 
(node). The chip has two embedded IBM 440 
CPU cores. Each core has an enhanced floating 
point unit capable of two multiply/add (MADD) 
instructions per cycle. Therefore the chip is ca- 
pable of a peak of 8 floating point instructions 
per cycle. At the operation speed of 700 MHz 
the chip delivers 5.6 GFlops of peak speed. The 
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memory hierarch is shown in figure 2. 
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Figure 2. An abstraction of the BG/L node chip. 

Several network controllers are built in on the 
chip. The main network connects the nodes in 
a 3-dimensional torus grid with nearest neigh- 
bor high speed interconnects. The hardware im- 
plements a sophisticated dynamic virtual cut- 
through network. Packets can be sent from any 
node to any other node without CPU interven- 
tion. Furthermore, the nodes are also connected 
via a global tree network that can efficiently per- 
form global operations as well as file I/O. Also, 
the nodes are connected via separate interrupt 
and control networks. The system is fully sym- 
metric with respect to the two CPU cores. 

3. The BlueGene/L supercomputer 

Two chips are assembled on a compute-card. 
Sixteen compute-cards are connected with con- 
nectors on a node-card. In case of a chip failure 
the compute-card can be easily replaced. Thirty- 
two node-cards are plugged into 2 vertical mid- 
plane cards. The 1024 node assembly is housed 
in a "refrigerator" sized cabinet (rack). Many 
racks can be connected with cables to form large 
systems. Figure 3 shows the 2 rack prototype at 
the IBM Watson lab. Also, a single rack can be 
configured as an 8 x 8 x 16 torus or as two 8x8x8 
torus grids without having to be re-cabled. 

4. QCD on the hardware 

QCD can use the two CPU cores of the BG/L 
chip in two basic modes. 1) Co-processor mode 



where CPUO does all the computations and CPU1 
does all the communications (including MPI etc.). 
The 4th direction is internal to CPUO. Commu- 
nications can overlap with computations but the 
peak performance is then limited to 5.6/2 = 2.8 
GFlops. 2). Virtual node mode where CPUO and 
CPU1 act as independent "virtual nodes", each 
with its own memory space. Each one does both 
computations and communications. The 4th di- 
rection is along the two CPUs that communicate 
via common memory. Computations and com- 
munications can not overlap but the peak perfor- 
mance is the full 5.6 GFlops. 




Figure 3. The 2,048 node, 11.5 Teraflops Blue- 
Gene/L prototype at the IBM T.J. Watson lab. 

5. The Wilson operator 

The Wilson operator is coded in virtual node 
mode. The standard spin projection and recon- 
struction algorithm is used. The multiplication 
of the gauge field with a spinor is done using 
the spin projected two component spinors. Also, 
only two component spinors need to be trans- 
ferred between nodes. All computations use the 
double multiply/add instructions. Computations 
overlap with load/stores. Local performance is 
bounded by memory access to L3. Communica- 
tions do not overlap with computations or mem- 
ory access. Because of the above there is an in- 
teresting tradeoff: For small local size one has 
fast LI memory access but more communications 
(larger surface to volume ratio). For large local 
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size one has slower L3 memory access but less 
communications. The inner most kernel is writ- 
ten in "pseudo-assembly" (c or C++ inline assem- 
bly). 
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Figure 4. Performance vs. local lattice size. 

6. Performance 

Figure 4 shows the performance of p without 
communications, of Jj) with communications and 
of the full inverter. Figure 5 shows the perfor- 
mance for a fixed 4 3 x 16 local lattice as a function 
of increasing machine size (here the CG inverter 
global sums were done on the torus). 

7. Dream machines 

Here are some examples of interesting ma- 
chine configurations. Of course this is just a 
sample of what is possible. A full 1024 node 
rack, (8, 8, 16, 2) CPUs, with a local lattice of 
(4,4,4,16) sites gives a lattice of (32,32,64,32) 
sites. This is appealing for next generation dy- 
namical zero temperature calculations. A half 
node rack, 512 nodes, (8,8,8,2) CPUs with a 
local lattice of (4, 4, 4, 4) sites gives a lattice of 
(32, 32, 32, 8) sites. This is suitable for next gen- 
eration thermodynamic calculations. 

If 16 racks of the BG/L system at IBM Wat- 
son are configured as (16, 32, 32, 2) and a local 
lattice with (4, 2, 2, 32) sites is used then one has 
a lattice with (64, 64, 64, 64) sites. If the 64 racks 



of the BG/L system at LLNL is configured as 
(64,32,32,2) and a local lattice of (2,2,2,32) is 
used then one has a lattice with (128, 64, 64, 64) 
sites. Clearly these are dream machines for QCD. 
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Figure 5. Performance of the Wilson CG in- 
verter vs. # of CPUs for a 4 3 x 16 local lattice. 

8. Physics software 

Most of the high level physics code is the 
Columbia C++ physics system (cps). The full 
system ported easily and worked immediately. 

9. Conclusions 

QCD crossed the 1 sustained- Teraflops land- 
mark in June 2004. In the next year, because of 
analytical and supercomputer developments, dy- 
namical QCD will likely get to L/a = 32 at phys- 
ical quark masses and perhaps even more... 
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